transition region
Enhanced Sampling for Efficient Learning of Coarse-Grained Machine Learning Potentials
Chen, Weilong, Görlich, Franz, Fuchs, Paul, Zavadlav, Julija
Coarse-graining (CG) enables molecular dynamics (MD) simulations of larger systems and longer timescales that are otherwise infeasible with atomistic models. Machine learning potentials (MLPs), with their capacity to capture many-body interactions, can provide accurate approximations of the potential of mean force (PMF) in CG models. Current CG MLPs are typically trained in a bottom-up manner via force matching, which in practice relies on configurations sampled from the unbiased equilibrium Boltzmann distribution to ensure thermodynamic consistency. This convention poses two key limitations: first, sufficiently long atomistic trajectories are needed to reach convergence; and second, even once equilibrated, transition regions remain poorly sampled. To address these issues, we employ enhanced sampling to bias along CG degrees of freedom for data generation, and then recompute the forces with respect to the unbiased potential. This strategy simultaneously shortens the simulation time required to produce equilibrated data and enriches sampling in transition regions, while preserving the correct PMF. We demonstrate its effectiveness on the Müller-Brown potential and capped alanine, achieving notable improvements. Our findings support the use of enhanced sampling for force matching as a promising direction to improve the accuracy and reliability of CG MLPs.
RDSinger: Reference-based Diffusion Network for Singing Voice Synthesis
Sui, Kehan, Xiang, Jinxu, Jin, Fang
Singing voice synthesis (SVS) aims to produce high-fidelity singing audio from music scores, requiring a detailed understanding of notes, pitch, and duration, unlike text-to-speech tasks. Although diffusion models have shown exceptional performance in various generative tasks like image and video creation, their application in SVS is hindered by time complexity and the challenge of capturing acoustic features, particularly during pitch transitions. Some networks learn from the prior distribution and use the compressed latent state as a better start in the diffusion model, but the denoising step doesn't consistently improve quality over the entire duration. We introduce RDSinger, a reference-based denoising diffusion network that generates high-quality audio for SVS tasks. Our approach is inspired by Animate Anyone, a diffusion image network that maintains intricate appearance features from reference images. RDSinger utilizes FastSpeech2 mel-spectrogram as a reference to mitigate denoising step artifacts. Additionally, existing models could be influenced by misleading information on the compressed latent state during pitch transitions. We address this issue by applying Gaussian blur on partial reference mel-spectrogram and adjusting loss weights in these regions. Extensive ablation studies demonstrate the efficiency of our method. Evaluations on OpenCpop, a Chinese singing dataset, show that RDSinger outperforms current state-of-the-art SVS methods in performance.
Smooth Like Butter: Evaluating Multi-Lattice Transitions in Property-Augmented Latent Spaces
Baldwin, Martha, Meisel, Nicholas A., McComb, Christopher
Additive manufacturing has revolutionized structural optimization by enhancing component strength and reducing material requirements. One approach used to achieve these improvements is the application of multi-lattice structures, where the macro-scale performance relies on the detailed design of mesostructural lattice elements. Many current approaches to designing such structures use data-driven design to generate multi-lattice transition regions, making use of machine learning models that are informed solely by the geometry of the mesostructures. However, it remains unclear if the integration of mechanical properties into the dataset used to train such machine learning models would be beneficial beyond using geometric data alone. To address this issue, this work implements and evaluates a hybrid geometry/property Variational Autoencoder (VAE) for generating multi-lattice transition regions. In our study, we found that hybrid VAEs demonstrate enhanced performance in maintaining stiffness continuity through transition regions, indicating their suitability for design tasks requiring smooth mechanical properties.
How Do Neural Spoofing Countermeasures Detect Partially Spoofed Audio?
Liu, Tianchi, Zhang, Lin, Das, Rohan Kumar, Ma, Yi, Tao, Ruijie, Li, Haizhou
Partially manipulating a sentence can greatly change its meaning. Recent work shows that countermeasures (CMs) trained on partially spoofed audio can effectively detect such spoofing. However, the current understanding of the decision-making process of CMs is limited. We utilize Grad-CAM and introduce a quantitative analysis metric to interpret CMs' decisions. We find that CMs prioritize the artifacts of transition regions created when concatenating bona fide and spoofed audio. This focus differs from that of CMs trained on fully spoofed audio, which concentrate on the pattern differences between bona fide and spoofed parts. Our further investigation explains the varying nature of CMs' focus while making correct or incorrect predictions. These insights provide a basis for the design of CM models and the creation of datasets. Moreover, this work lays a foundation of interpretability in the field of partial spoofed audio detection that has not been well explored previously.
QueSTMaps: Queryable Semantic Topological Maps for 3D Scene Understanding
Mehan, Yash, Gupta, Kumaraditya, Jayanti, Rohit, Govil, Anirudh, Garg, Sourav, Krishna, Madhava
Understanding the structural organisation of 3D indoor scenes in terms of rooms is often accomplished via floorplan extraction. Robotic tasks such as planning and navigation require a semantic understanding of the scene as well. This is typically achieved via object-level semantic segmentation. However, such methods struggle to segment out topological regions like "kitchen" in the scene. In this work, we introduce a two-step pipeline. First, we extract a topological map, i.e., floorplan of the indoor scene using a novel multi-channel occupancy representation. Then, we generate CLIP-aligned features and semantic labels for every room instance based on the objects it contains using a self-attention transformer. Our language-topology alignment supports natural language querying, e.g., a "place to cook" locates the "kitchen". We outperform the current state-of-the-art on room segmentation by ~20% and room classification by ~12%. Our detailed qualitative analysis and ablation studies provide insights into the problem of joint structural and semantic 3D scene understanding.
Efficient Learning of Fast Inverse Kinematics with Collision Avoidance
Tenhumberg, Johannes, Mielke, Arman, Bäuml, Berthold
Fast inverse kinematics (IK) is a central component in robotic motion planning. For complex robots, IK methods are often based on root search and non-linear optimization algorithms. These algorithms can be massively sped up using a neural network to predict a good initial guess, which can then be refined in a few numerical iterations. Besides previous work on learning-based IK, we present a learning approach for the fundamentally more complex problem of IK with collision avoidance. We do this in diverse and previously unseen environments. From a detailed analysis of the IK learning problem, we derive a network and unsupervised learning architecture that removes the need for a sample data generation step. Using the trained network's prediction as an initial guess for a two-stage Jacobian-based solver allows for fast and accurate computation of the collision-free IK. For the humanoid robot, Agile Justin (19 DoF), the collision-free IK is solved in less than 10 milliseconds (on a single CPU core) and with an accuracy of 10^-4 m and 10^-3 rad based on a high-resolution world model generated from the robot's integrated 3D sensor. Our method massively outperforms a random multi-start baseline in a benchmark with the 19 DoF humanoid and challenging 3D environments. It requires ten times less training time than a supervised training method while achieving comparable results.
Safe Navigation using Density Functions
Zheng, Andrew, Narayanan, Sriram S. K. S., Vaidya, Umesh
This paper presents a novel approach for safe control synthesis using the dual formulation of the navigation problem. The main contribution of this paper is in the analytical construction of density functions for almost everywhere navigation with safety constraints. In contrast to the existing approaches, where density functions are used for the analysis of navigation problems, we use density functions for the synthesis of safe controllers. We provide convergence proof using the proposed density functions for navigation with safety. Further, we use these density functions to design feedback controllers capable of navigating in cluttered environments and high-dimensional configuration spaces. The proposed analytical construction of density functions overcomes the problem associated with navigation functions, which are known to exist but challenging to construct, and potential functions, which suffer from local minima. Application of the developed framework is demonstrated on simple integrator dynamics and fully actuated robotic systems.
Smoothing the Rough Edges: Evaluating Automatically Generated Multi-Lattice Transitions
Baldwin, Martha, Meisel, Nicholas A., McComb, Christopher
Additive manufacturing is advantageous for producing lightweight components while addressing complex design requirements. This capability has been bolstered by the introduction of unit lattice cells and the gradation of those cells. In cases where loading varies throughout a part, it may be beneficial to use multiple, distinct lattice cell types, resulting in multi-lattice structures. In such structures, abrupt transitions between unit cell topologies may cause stress concentrations, making the boundary between unit cell types a primary failure point. Thus, these regions require careful design in order to ensure the overall functionality of the part. Although computational design approaches have been proposed, smooth transition regions are still difficult to achieve, especially between lattices of drastically different topologies. This work demonstrates and assesses a method for using variational autoencoders to automate the creation of transitional lattice cells, examining the factors that contribute to smooth transitions. Through computational experimentation, it was found that the smoothness of transition regions was strongly predicted by how closely the endpoints were in the latent space, whereas the number of transition intervals was not a sole predictor.
Baidu's PP-Matting: Trimap-Free High-Accuracy Natural Image Matting
Differentiating a target foreground subject from its background is a fundamental computer vision task with widespread applications in image editing and composition. Basic segmentation approaches that use a binary pixel classification scheme do not consider the varying opacity in foreground/background edge pixels, resulting in hard and unnaturally contrastive edges around the foreground subject. Although recent deep learning-based natural image matting techniques have been shown to significantly improve fine-grained detail in these areas by estimating per-pixel opacity of the target foreground, these techniques rely on user-supplied trimaps as an auxiliary input, which limits their real-world applicability. In the new paper PP-Matting: High-Accuracy Natural Image Matting, a Baidu research team proposes PP-Matting, a trimap-free architecture that combines a high-resolution detail branch and a semantic context branch to achieve state-of-the-art performance on natural image matting tasks. In an input image comprising a target foreground subject and a background, the colour of each pixel is formulated as a linear combination equation of foreground and background colours, while an alpha matte defines the pixels' relative opacity.
Smooth activations and reproducibility in deep networks
Shamir, Gil I., Lin, Dong, Coviello, Lorenzo
Deep networks are gradually penetrating almost every domain in our lives due to their amazing success. However, with substantive performance accuracy improvements comes the price of irreproducibility. Two identical models, trained on the exact same training dataset may exhibit large differences in predictions on individual examples even when average accuracy is similar, especially when trained on highly distributed parallel systems. The popular Rectified Linear Unit (ReLU) activation has been key to recent success of deep networks. We demonstrate, however, that ReLU is also a catalyzer to irreproducibility in deep networks. We show that not only can activations smoother than ReLU provide better accuracy, but they can also provide better accuracy-reproducibility tradeoffs. We propose a new family of activations; Smooth ReLU (SmeLU), designed to give such better tradeoffs, while also keeping the mathematical expression simple, and thus implementation cheap. SmeLU is monotonic, mimics ReLU, while providing continuous gradients, yielding better reproducibility. We generalize SmeLU to give even more flexibility and then demonstrate that SmeLU and its generalized form are special cases of a more general methodology of REctified Smooth Continuous Unit (RESCU) activations. Empirical results demonstrate the superior accuracy-reproducibility tradeoffs with smooth activations, SmeLU in particular. Recent developments in deep learning leave no question about the advantages of deep networks over classical methods, which relied heavily on linear convex optimization solutions. With their astonishing unprecedented success, deep models are providing solutions to a continuously increasing number of domains in our lives. These solutions, however, while much more accurate than their convex counterparts, are usually irreproducible in the predictions they provide. While average accuracy of deep models on some validation dataset is usually much higher than that of linear convex models, predictions on individual examples of two models, that were trained to be identical, may diverge substantially, exhibiting Prediction Differences that may be as high as non-negligible fractions of the actual predictions (see, e.g., Chen et al. (2020); Dusenberry et al. (2020)). Deep networks express (only) what they learned.